Introduction

The goal of this analysis is to find a model that fits the observed cumulative cases of COVID-19 in the US, starting in Mid-July 2021 and ending in late November 2021. Mid-July coincides with the introduction and subsequent domination of the delta variant of COVID-19 and late November is the latest data released by the World Health Organization at the time of writing. The data used for this project is made freely available by the World Health Organization (WHO) here. The first section includes exploratory data analysis and formatting. The second section includes several helper functions for plotting data and results. The third section is comprised of the candidate models defined in increasing levels of complexity. The final section is the comparison and conclusion.

Observations:

Several Helper Functions for plotting

Candidate Models

For each model, the trace, energy, and model summaries are printed. Model summaries include means and credible sets for each parameter. Each parameter is defined in the cell that begins with: "with pm.Model() as ..." and is described briefly in the comment above. After each model is sampled and summarized, the model's posterior is printed in log and linear scale overlayed with original data. Residuals are also plotted below.

Model Comparison and Conclusion

Conclusion

Both WAIC and LOO prefer the simplest linear model to all others, despite the log-scale posterior predictive charts appearing to favor the logistic model. Each model has a unique flaw based on the expected shape of the posterior. The linear model will not predict any change in slope over time, while the exponential model will never predict a reduction in slope. However, the observed cumulative case data resembles an exponential growth early on and a leveling off. By this argument, we would expect the logistic model to perform best, however, due to the "s" shape of the sigmoid family of curves, the logistic model is bound to eventually predict a leveling off of cumulative cases after an exponential increase; which leads to a poor fit for this particular time frame due to the second increase in slope after an initial leveling off.

As seen in the charts above, infections increase exponentially early on, then new infections decrease just as rapidly after peaking in early September until late October. This leads to a sigmoid/logit-like shape of cumulative cases over this period. However, new cases begin to increase again in late October, leading to the second increase in the slope of cumulative cases. This second increase in the rate of new cases limits the effectiveness of a logistic model, which is unable to model a second exponential increase in rate.

To improve the fit of this model further, one would want to try an autoregressive time-series model that accounts for seasonal changes, public policy changes, and other external factors that make this problem fundamentally non-stationary.

If above hyper link did not work, data source: https://covid19.who.int/info?openIndex=2

Inspiration for this project was taken from the Thomas Wiecki talk: The Bayesian Workflow: Building a COVID-19 Model by Thomas Wiecki, available here: https://discourse.pymc.io/t/the-bayesian-workflow-building-a-covid-19-model-by-thomas-wiecki/6017 . I use different methods, data, and ultimately come to a different conclusion than the original talk.